BiRe-ID: Binary Neural Network for Efficient Person Re-ID

153

Then, we use efficient XNOR and Bit-count operations to replace real-valued operations.

Following [199], the forward process of the BNN is

ai = bai1 bwi,

(6.7)

whererepresents efficient XNOR and Bit-count operations. Based on XNOR-Net, we

introduce a learnable channel-wise scale factor to modulate the amplitude of real-valued

convolution. Aligned with the Batch Normalization (BN) and activation layers, the 1-bit

convolution is formulated as

bai = sign(Φ(αibai1 bwi)).

(6.8)

In KR-GAL, the original output feature ai is first scaled by a channel-wise scale factor

(vector) αiRCi to modulate the amplitude of the real-valued counterparts. It then enters

Φ(·), which represents a composite function built by stacking several layers, e.g., BN layer,

non-linear activation layer, and max pool layer. The output is then binarized to obtain the

binary activations bai Bni, using the sign function. sign(·) denotes the sign function that

returns +1 if the input is greater than zeros and1 otherwise. Then, the 1-bit activation

bai can be used for the efficient XNOR and Bit-count of (i + 1)-th layer.

However, the gap in representational capability between wi and bwi could lead to a

large quantization error. We aim to minimize this performance gap to reduce the quan-

tization error while increasing the binarized kernels’ ability to provide information gains.

Therefore, αi is also used to reconstruct bwi into wi. This learnable scale factor can lead to

a novel learning process with more precise estimation of convolutional filters by minimizing

a novel adversarial loss. Discriminators D(·) with weights WD are introduced to distinguish

unbinarized kernels wi from reconstructed ones αibwi. Therefore, αi and WD are learned

by solving the following optimization problem.

arg

min

wi,bwii max

WD L K

Adv(wi, bwi, αi, WD) + L K

MSE(wi, bwi, αi)iN,

(6.9)

where L K

Adv(wi, bwi, αi, WD) is the adversarial loss as

L K

Adv(wi, bwi, αi, WD) = log(D(wi; WD)) + log(1D(bwiαi; WD)),

(6.10)

where D(·) consists of several basic blocks, each with a fully connected layer and a

LeakyReLU layer. In addition, we employ discriminators to refine every binarized con-

volution layer during the binarization training process.

Furthermore, LMSE(wi, bwi, αi) is the kernel loss between the learned real-valued filters

wi and the binarized filters bwi, which is expressed by MSE as

L K

MSE(wi, bwi, αi) = λ

2 ||wiαibwi||2

2,

(6.11)

where MSE is used to balance the gap between real value wi and binarized bwi. λ is a

balance hyperparameter.

6.2.3

Feature Refining Generative Adversarial Learning (FR-GAL)

We introduce generative adversarial learning (GAL) to refine the low-level characteristic

through self-supervision. We employ the high-level feature with abundant semantic infor-

mation aHRmH to supervise the low-level feature aLRmL, where mH = CH ·WH ·HH

and mL = CL · WL · HL. To keep the channel dimension identical, we first employ a 1 × 1

convolution to reduce CH to CL as

a

H = f(W1×1 aH),

(6.12)